25 research outputs found
A reproducible approach with R markdown to automatic classification of medical certificates in French
In this paper, we report the ongoing developments of our first participation to the Cross-Language Evaluation Forum (CLEF) eHealth Task 1: âMultilingual Information Extraction - ICD10 codingâ (NĂ©vĂ©ol et al., 2017). The task consists in labelling death certificates, in French with international standard codes. In particular, we wanted to accomplish the goal of the âReplication trackâ of this Task which promotes the sharing of tools and the dissemination of solid, reproducible results.In questo articolo presentiamo gli sviluppi del lavoro iniziato con la partecipazione al Laboratorio CrossLanguage Evaluation Forum (CLEF) eHealth denominato: âMultilingual Information Extraction - ICD10 codingâ (NĂ©vĂ©ol et al., 2017) che ha come obiettivo quello di classificare certificati di morte in lingua francese con dei codici standard internazionali. In particolare, abbiamo come obiettivo quello proposto dalla âReplication trackâ di questo Task, che promuove la condivisione di strumenti e la diffusione di risultati riproducibili
An interactive two-dimensional approach to query aspects rewriting in systematic reviews. IMS unipd at CLEF eHealth task 2
International audienc
A lexicon based approach to classification of ICD10 codes. IMS unipd at CLEF eHealth task 1
International audienc
In search of the "gold nugget" : A textometric of study of the work of Milan Kundera
Cette Ă©tude consiste en une analyse linguistique intĂ©grĂ©e de l'Ćuvre de Milan Kundera, Ă©crivain tchĂšque naturalisĂ© français. Par analyse intĂ©grĂ©e, nous entendons une Ă©tude linguistique menĂ©e Ă l'aide des mĂ©thodes qualitatives et quantitatives. Plus prĂ©cisĂ©ment, les mĂ©thodes utilisĂ©es appartiennent au domaine de la textomĂ©trie, discipline dont l'objectif est d'analyser les corpus textuels par le biais d'un traitement informatisĂ© (Guiraud, 1960 ; Lebart, Salem, 1994 ; Pincemin, 2020). Plus gĂ©nĂ©ralement, ces travaux pourraient donc ĂȘtre inclus dans le domaine de la stylomĂ©trie, puisque cette analyse textomĂ©trique est fonctionnelle Ă la « caractĂ©risation d'une Ă©criture » (Magri, 2010). En effet, l'objectif principal de cette recherche est de dĂ©tecter par contraste les Ă©lĂ©ments qui dĂ©finissent la prose de Kundera. Pour ce faire, deux corpus ont Ă©tĂ© composĂ©s : un corpus d'Ă©tude et un corpus de rĂ©fĂ©rence (Rastier, 2011). Le premier correspond Ă la quasi-totalitĂ© des textes de l'Ćuvre I, II de Kundera (Ăd. Gallimard, PlĂ©iade). Le second est reprĂ©sentatif du paysage littĂ©raire français de la pĂ©riode d'activitĂ© de Kundera (1968-2013).Ces corpus ont Ă©tĂ© d'abord numĂ©risĂ©s et ensuite examinĂ©s Ă l'aide du logiciel de textomĂ©trie Hyperbase (version web et standard), qui emploie Ă la fois les mĂ©thodes classiques d'exploration statistique et le deep learning ou apprentissage profond. Ce logiciel permet diverses analyses aux diffĂ©rents niveaux lexical, morphosyntaxique et sĂ©mantique. En particulier, les Ă©lĂ©ments suivants ont fait l'objet de l'Ă©tude : la structure du vocabulaire (la distribution des frĂ©quences, des hapax, la richesse lexicale, la diversitĂ© du vocabulaire et l'accroissement lexical) ; les aspects morphologiques et syntaxiques qui peuvent ĂȘtre examinĂ©s grĂące aux versions lemmatisĂ©es et Ă©tiquetĂ©es des corpus ; les motifs morpho-syntaxiques et multidimensionnels ; les thĂšmes (les spĂ©cificitĂ©s lexicales, les isotopies et les thĂšmes rĂ©currents). Ces Ă©lĂ©ments ont Ă©tĂ© examinĂ©s lors d'une analyse endogĂšne du corpus d'Ă©tude et d'une sĂ©rie d'analyses exogĂšnes avec le corpus de rĂ©fĂ©rence. En effet, les Ă©tudes comparatives avec le second corpus permettent de neutraliser les caractĂ©ristiques linguistiques conformes Ă la langue littĂ©raire de l'Ă©poque dans le genre du roman, de l'essai et de la nouvelle, afin de faire ressortir les Ă©lĂ©ments de la prose de Kundera qui se distinguent de ce modĂšle linguistique reprĂ©sentatif de la langue littĂ©raire contemporaine. En outre, les analyses endogĂšnes de l'Ćuvre de Kundera, possibles grĂące Ă la compilation de sous-corpus, peuvent rendre compte Ă la fois des constantes stylistiques qui ne varient pas selon le genre, la pĂ©riode ou la langue et des variantes linguistiques qui dĂ©pendent des variables diachroniques, gĂ©nĂ©riques et linguistiques. En conclusion, cette Ă©tude emploie une mĂ©thodologie intĂ©grĂ©e (linguistique, statistique, informatique) dans le but de faire ressortir les caractĂ©ristiques prototypiques de l'idiolecte de Kundera, Ă savoir les Ă©lĂ©ments les plus significatifs de son Ă©criture qui la distinguent de celle d'un Ă©chantillon reprĂ©sentatif d'auteurs français Ă lui contemporains.This study consists of an integrated linguistic analysis of the work of Milan Kundera. By integrated analysis, we mean a linguistic study carried out through qualitative and quanti-tative methods. These methods belong to the field of textometry, a discipline whose objective is to analyse textual corpora through computer processing (Guiraud, 1960; Lebart, Salem, 1994; Pincemin, 2020). More generally, this work could therefore be included in the field of stylometry, since this textometric analysis is functional to the characterization of a style of writing (Magri, 2010). Indeed, the main objective of this research is to detect by contrast the elements that define Kundera's prose. To this end, two corpora were composed : a corpus of study and a reference corpus (Rastier, 2011). The first comprehends almost all the texts of Kundera's Ćuvre I, II (Gallimard, PlĂ©iade). The second is representative of the French literary landscape of the period in which Kundera published his texts (1968-2013).The corpora were first digitised and then examined using the textometry software Hyperbase (web and standard version), which employs both classical statistical methods and deep learning techniques (CNN, Convolutional neural network).This software allows various analyses on lexical, morphosyntactic and semantic levels. In particular, the following elements have been investigated : the vocabulary structure, morphological and syntactic aspects, morphosyntactic and multidimensional patterns, and finally the thematic structure.These elements were examined in an endogenous analysis of the corpus of study and in a series of exogenous analyses between the corpus of study and the reference corpus. Indeed, comparative studies between Kundera's work and the contrastive norm represented by the reference corpus aim to isolate the linguistic characteristics of the literary language of the time in novels, essays and short stories, in order to detect the distinguishing elements of Kundera's prose that differ from the linguistic model of his contemporaries' literary language. In addition, endogenous analyses of Kundera's work - made possible by the compilation of subcorpora - can account for linguistic constants that are independent of genre, period and/or language, as well as for linguistic variants determined by literary genre, diachronic and/or linguistic variability. In conclusion, this study employs an integrated methodology (linguistics, literature, statistics, deep learning) with the aim of defining the prototypical features of Kundera's idiolect, that is, the most significant elements that distinguish his writing from that of a representative sample of his contemporary French authors
Ătude textometrique de lâĆuvre de Milan Kundera. Ă la recherche de la « pepite dâor »
Cette Ă©tude consiste en une analyse linguistique intĂ©grĂ©e de lâĆuvre de Milan Kundera, Ă©crivain tchĂšque naturalisĂ© français. Par analyse intĂ©grĂ©e, nous entendons une Ă©tude linguistique menĂ©e Ă lâaide des mĂ©thodes qualitatives et quantitatives. Plus prĂ©cisĂ©ment, les mĂ©thodes utilisĂ©es appartiennent au domaine de la textomĂ©trie, discipline dont lâobjectif est dâanalyser les corpus textuels par le biais dâun traitement informatisĂ© (Guiraud, 1960 ; Lebart, Salem, 1994 ; Pincemin, 2020). Plus gĂ©nĂ©ralement, ces travaux pourraient donc ĂȘtre inclus dans le domaine de la stylomĂ©trie, puisque cette analyse textomĂ©trique est fonctionnelle Ă la « caractĂ©risation d'une Ă©criture » (Magri, 2010).
En effet, l'objectif principal de cette recherche est de dĂ©tecter par contraste les Ă©lĂ©ments qui dĂ©finissent la prose de Kundera. Pour ce faire, deux corpus ont Ă©tĂ© composĂ©s : un corpus dâĂ©tude et un corpus de rĂ©fĂ©rence (Rastier, 2011). Le premier correspond Ă la quasi-totalitĂ© des textes de lâĆuvre I, II de Kundera (Ăd. Gallimard, PlĂ©iade). Le second est reprĂ©sentatif du paysage littĂ©raire français de la pĂ©riode d'activitĂ© de Kundera (1968-2013). Pour le compiler, nous avons sĂ©lectionnĂ© les textes qui, sur la base de certains critĂšres (prix littĂ©raires, Ă©tudes littĂ©raires, commentaires des critiques), peuvent ĂȘtre considĂ©rĂ©s comme les plus significatifs de cette pĂ©riode littĂ©raire.
Ces corpus ont Ă©tĂ© dâabord numĂ©risĂ©s et ensuite examinĂ©s Ă lâaide du logiciel de textomĂ©trie Hyperbase (version web et standard), qui emploie Ă la fois les mĂ©thodes classiques dâexploration statistique et le deep learning ou apprentissage profond.
Ce logiciel permet diverses analyses aux diffĂ©rents niveaux lexical, morphosyntaxique et sĂ©mantique. En particulier, les Ă©lĂ©ments suivants ont fait lâobjet de lâĂ©tude : la structure du vocabulaire (la distribution des frĂ©quences, des hapax, la richesse lexicale, la diversitĂ© du vocabulaire et lâaccroissement lexical) ; les aspects morphologiques et syntaxiques qui peuvent ĂȘtre examinĂ©s grĂące aux versions lemmatisĂ©es et Ă©tiquetĂ©es des corpus ; les motifs morphosyntaxiques et multidimensionnels ; le contenu lexical et thĂ©matique (les spĂ©cificitĂ©s lexicales, les isotopies et les thĂšmes rĂ©currents).
Ces Ă©lĂ©ments ont Ă©tĂ© examinĂ©s lors dâune analyse endogĂšne du corpus d'Ă©tude et dâune sĂ©rie d'analyses exogĂšnes avec le corpus de rĂ©fĂ©rence. En effet, les Ă©tudes comparatives avec le second corpus permettent de neutraliser les caractĂ©ristiques linguistiques conformes Ă la langue littĂ©raire de l'Ă©poque dans le genre du roman, de l'essai et de la nouvelle, afin de faire ressortir les Ă©lĂ©ments de la prose de Kundera qui se distinguent de ce modĂšle linguistique reprĂ©sentatif de la langue littĂ©raire contemporaine. En outre, les analyses endogĂšnes de l'Ćuvre de Kundera, possibles grĂące Ă la compilation de sous-corpus, peuvent rendre compte Ă la fois des constantes stylistiques qui ne varient pas selon le genre, la pĂ©riode ou la langue et des variantes linguistiques qui dĂ©pendent des variables diachroniques, gĂ©nĂ©riques et linguistiques.
En conclusion, cette Ă©tude emploie une mĂ©thodologie intĂ©grĂ©e (linguistique, statistique, informatique) dans le but de faire ressortir les caractĂ©ristiques prototypiques de lâidiolecte de Kundera, Ă savoir les Ă©lĂ©ments les plus significatifs de son Ă©criture qui la distinguent de celle dâun Ă©chantillon reprĂ©sentatif dâauteurs français Ă lui contemporains.This study consists of an integrated linguistic analysis of the work of Milan Kundera, a naturalized Czech writer. By integrated analysis, we mean a linguistic study carried out through qualitative and quantitative methods. These methods belong to the field of textometry, a discipline whose objective is to analyse textual corpora through computer processing (Guiraud, 1960; Lebart, Salem, 1994; Pincemin, 2020). More generally, this work could therefore be included in the field of stylometry, since this textometric analysis is functional to the characterization of a writing style (Magri, 2010).
Indeed, the main objective of this research is to detect by contrast the elements that define Kundera's prose. To this end, two corpora were composed: a corpus of study and a reference corpus (Rastier, 2011). The first comprehends almost all the texts of Kundera's Ćuvre I, II (Gallimard, PlĂ©iade). The second is representative of the French literary landscape of the period in which Kundera published his texts (1968-2013). In order to compile the latter corpus, we have selected those texts which, on the basis of certain criteria (literary prizes, literary studies, critics' works), can be considered the most significant of the aforementioned literary period.
The corpora were first digitised and then examined using the textometry software Hyperbase (web and standard version), which employs both classical statistical methods and deep learning techniques (CNN, Convolutional neural network).
This software allows various analyses on lexical, morphosyntactic and semantic levels. In particular, the following elements have been investigated: the vocabulary structure, morphological and syntactic aspects, morphosyntactic and multidimensional patterns, and finally the thematic structure.
These elements were examined in an endogenous analysis of the corpus of study and in a series of exogenous analyses between the corpus of study and the reference corpus. Indeed, comparative studies between Kundera's work and the contrastive norm represented by the reference corpus aim to isolate the linguistic characteristics of the literary language of the time in novels, essays and short stories, in order to detect the distinguishing elements of Kundera's prose that differ from the linguistic model of his contemporaries' literary language. In addition, endogenous analyses of Kundera's work â made possible by the creation of subcorpora â can account for stylistic constants that are independent of genre, period and/or language, as well as for linguistic variants determined by literary genre, diachronic and/or linguistic variability.
In conclusion, this study employs an integrated methodology (linguistics, statistics, deep learning) with the aim of defining the prototypical features of Kundera's idiolect, that is, the most significant elements that distinguish his writing from that of a representative sample of his contemporary French authors
A lexicon based approach to classification of ICD10 codes. IMS unipd at CLEF eHealth task 1
International audienc